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ABSTRACT 


This  paper  describes  the  measurement  and  analysis  of  hard  CPU  and  memory 
errors,  and  system  activity  at  the  Stanford  Linear  Accelerator  Center 
computational  facility.  Nearly  25  percent  of  the  errors  were  estimated 
to  be  permanent.  The  occurrence  of  a  failure  was  found  to  be  strongly 
correlated  with  the  level  and  type  of  workload  prior  to  the  occurrence 
of  the  failure.  For  example,  it  is  shown  that  the  risk  of  a  permanent 
error  increases  in  a  non-linear  fashion  with  the  amount  of  interactive 
processing.  The  observed  tendency  is  present  in  three  years  of  load 
data.  This  observation  is  significant  because  a  load-failure  relation¬ 
ship  found  at  the  CPU  level  must,  in  our  view,  be  considered  fundamen¬ 
tal.  In  addition,  the  fact  that  most  of  the  errors  are  permanent,  pro¬ 
vides  new  information  on  these  error  types  viz.  their  load  dependent 
behavior.  Our  analysis  procedure,  used  on  the  SLAC  data,  has  been  vali¬ 
dated  on  an  artificially  created  data  base  seeded  with  failures. 

Keywords;  Workload  and  failure  measurement,  data  analysis,  statistical 
models. 
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1.  INTRODUCTION 


The  highly  interactive  and  diverse  nature  af  Modern  day  systems  has  nade 
high  reliability  a  central  issue  in  computer  system  design.  It  is  not. 
in  general,  feasible  to  guarantee  a  perfect  system,  either  in  hardware 
or  in  software.  Accordingly,  depending  on  the  nature  of  the  applica¬ 
tion.  it  is  important  to  design  into  the  system  the  ability  either  to 
continue  operation  in  the  event  of  a  failure  or  to  react  to  a  failure  in 
a  predictable  manner. 

Theoretical  models  can  only  deal  with  a  restricted  class  of  problems. 
Most  often  it  is  the  problems  outside  the  range  of  theoretical  models 
which  cause  the  most  severe  malfunctions.  Accordingly,  at  this  stage 
there  is  no  better  substitute  for  results  based  on  actual  measurements 
and  experimentation.  An  experimental  study  provides  not  only  a  view  of 
the  end  product  but  also  gives  some  insight  into  persistent  problems. 
This  information  can  be  very  valuable  in  designing  new  systems. 

This  paper  describes  the  measurement  and  analysis  of  hard  CPU  and 
memory  errors,  and  system  activity  at  the  Stanford  Linear  Accelerator 
Center  (SLAC)  computational  facility.  The  authors'  approach  has  been  to 
start  with  a  substantial  body  of  empirical  data  on  system  load  and  fail¬ 
ures.  On  the  basis  of  these  measurements  several  experiments  were  con¬ 
ducted  to  examine  the  dependence  of  hard  failures  on  system  activity. 
The  salient  features  of  the  measurement  process  and  important  results 
are  outlined  below: 

1.  The  present  study  concentrates  on  hard  CPU  related  errors.  A 
measurable  number  of  the  failures  (between  15  -  25  percent)1  were 


1  Between  75  -  85  percent  of  all  errors  were  temporary  (transient  or 

intermittent)  and  are  discussed  in  Clyer  82b]  and  [Rossetti  81]. 
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estimated  to  be  hard  failures  (CPU  and  main  memory). 

2.  The  measurement  process  is  automatio;  it  captures  a  detailed 
internal  view  of  the  system,  especially  under  failure  conditions. 

3.  From  the  measurements,  a  completely  neu  data  base  of  failures  and 
workload  was  established.  The  workload  and  failure  data  were 
combined  in  order  to  match  failures  with  workloads  at  the  times 
of  failure. 

4.  The  measurements  and  statistical  experiments  clearly  demonstrate 
a  non-linear  increase  in  the  risk,  of  hard  CPU  related  errors, 
due  to  increased  values  of  workload  variables.  Examples  are  CPU 
utilization,  input/output  rate,  and  interrupt  rates. 

A  representative  measurement  is  illustrated  in  Fig.  1.  which  shows 
how  an  increase  in  the  system  CPU  usage.  SYSCPU,  (a  measure  of  the  sys¬ 
tem  overhead;  a  fraction  between  0  and  I)  can  result  in  higher  risk  of 
hardware  failures  in  the  CPU  and  main  memory.  The  horizontal  axis  is 
the  workload  variable;  the  vertical  axis  is  the  risk  of  error.  Modeling 
details  will  be  given  later  in  this  paper. 
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i.i  mmo  research  amp  hot  i  vat  ion 

There  is  now  considerable  experimental  evidence  to  show  that  computer 
reliability  is  a  dynamic  function  of  system  activity  (as  measured  by  the 
workload).  A  number  of  studies  [Butner  80],  [Iyer  82a, b]  and  [Castillo 
80,  81]  provide  statistical  evidence  on  a  number  of  machines  to  support 
this  observation.  Even  though  the  exact  nature  of  this  dependency  is 
not  fully  understood,  it  uould  appear  that  that  computing  systems,  uhich 
need  maximum  reliability  at  their  peak  load,  require  a  re-evaluation  of 
their  reliability  projections. 

An  important,  and  as  yet  unanswered  question  is  whether  an  increased 
level  of  system  activity  results  in  an  increased  level  of  hardware  fail¬ 
ures.  In  particular,  it  is  important  to  determine  whether  hard  failures 


4 


in  logic  elements  (CPU  and  storage)  are  also  workload  dependent  i.e.» 
does  higher  system  activity  result  in  a  higher  level  of  CPU  and  memory 
fai lures. 

Some  evidence  to  this  effect  was  available  from  an  early  analysis  of 
failures  on  the  SLAC  Triplex  Clyer  82a].  The  study  found  a  strong  cor¬ 
relation  between  the  occurrence  of  hard  failures  and  the  load  on  the 
system,  as  measured  by  variables  such  as  the  paging  rate  and  the  jobstep 
processing  rate.  All  failures  were  considered,  not  simply  the  ones 
which  led  to  system  service  interruptions.  Most  importantly  the  effects 
were  such  that  the  average  failure  rate  for  various  system  components 
varied  cyclicly  over  a  band  of  significant  width  as  determined  by  the 
daily  load  variations.  Fig.  2  below  is  a  representative  histogram,  from 
that  study,  of  all  hard  CPU  failures  plotted  by  the  hour  of  day,  aver¬ 
aged  over  1978. 

Subsequently  a  more  detailed  and  accurate  study  was  performed  on  all 
CPU  errors  [Iyer  82b],  It  was  found  that  all  errors  which  affected  the 
CPU  correlated  strongly  with  system  activity.  The  large  majority  of 
these  errors  however.  (75  -  85  percent)  were  temporary.  More  recent 
studies  conducted  on  the  IBM  3081  at  SLAC  found  a  similar  behaviour  with 
software  related  failures  on  VM/370  [Rossetti  82].  Additional  substan¬ 
tiation  of  these  results  came  from  experimental  studies  on  DEC  systems 
reported  in  [Castillo  80]. 

There  has  been  some  effort  at  modelling  the  observed  load/failure 
relationship.  Possible  cause-effect  scenarios  are  discussed  in  [Butner 
80]  and  [Iyer  82a].  Castillo  and  siewiorek  [Castillo  81,82]  have  pro¬ 
posed  the  use  of  a  doubly-stochastic  Poisson  process  to  model  the  cyclic 
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load-failure  relationship.  The  model  assumes  that  the  instantaneous 
failure  rate  can  be  described  by  a  cyclostationary  Gaussian  process.  In 
[Gunther  80]  a  novel  theoretical  model  for  an  apparent  dependency  of 
failure  on  load,  based  on  a  random  walk  formulation,  is  described. 
There  is  no  doubt  that  more  detailed  experimental  results  are  necessary 
before  a  clear  understanding  of  the  observed  behaviour  is  possible. 

The  next  section  discusses  the  failure  and  workload  measurements 
taken  and  briefly  presents  the  organization  of  the  data.  Subsequent 
sections  describe  the  procedures  employed  to  analyse  hard  failures  and 
present  new  results.  Finally,  the  paper  summarises  the  important 
results  and  highlights  the  conclusions  that  can  be  drawn  from  them. 
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2.  MEASUREMENTS 


2.1  ERROR  MEASUREMENT 

As  stated  earlier,  the  present  study  uses  the  most  detailed  data  from 
the  log  maintained  by  the  operating  system  as  errors  are  detected  by  the 
hardware  and  recorded  by  the  software.  High  level  system  behavior,  as 
seen  by  the  computer  operator  and  users,  is  not  directly  measured. 
Instead,  there  is  much  information  on  hardware  errors,  both  permanent 
and  non-permanent  (transient  and  intermittent),  as  they  occur  in  the 
detailed  operation  of  system  components. 

The  SLAC  system,  during  the  period  of  our  study,  consisted  of  two  IBM 
370/16$  mainframes  and  an  IBM  360/91  connected  in  a  triplex  mode.  The 
data  for  our  study,  which  consisted  of  three  years  of  measurements 
(1979,  I960,  and  19S1),  came  from  the  two  IBM  370/16$  mainframes.  The 
log  referred  to  above  is  commonly  called  the  "EREP"  log,  from  the  Envi¬ 
ronmental  Recording  Editing  and  Printing  program  used  to  accumulate  and 
format  it  for  maintenance  ClBM  79]. 

Errors  in  IBM  370  systems  are  classified  into  three  major  types: 

1.  CPU  Errors  -  In  the  central  processor  and  storage. 

2-  Channel  Errors  -  In  I/O  channels  and  associated  interfaces. 

3.  Outboard  Errors  -  In  any  device  beyond  the  channel-control  unit 
interface,  i.e.  all  errors  in  I/O  devices. 

For  each  error,  whether  recoverable  or  not,  the  operating  system  cre¬ 
ates  a  time-stamped  record  describing  the  error  and  providing  relevant 
information  on  the  state  of  the  machine.  As  an  example,  for  a  CPU 
error,  the  state  information  might  include  the  contents  of  all  internal 
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registers  and  diagnostic  information  collected  by  the  hardware  (such  as 
parity  indicators  and  error  flags).  At  SLAC  this  information  is  col¬ 
lected  on  a  daily  basis  and  archived  for  many  years. 

Since  the  EREP  log  does  not  specifically  provide  information  on  hard¬ 
ware  failures.  it  is  necessary  to  estimate  this  information  from  the 
data.  The  following  rule  was  used  to  estimate  a  hardware  failure:  If 
the  machine  check  condition  interrupted  the  CPU  and  re-occurred  three 
times  or  more  in  sequence,  the  failure  was  considered  to  be  a  hardware 
failure  in  the  CPU  or  main  memory.  The  vast  majority  of  these  failures 
were  storage  errors.  An  examination  with  the  repair  log  maintained 
showed  good  agreement.  An  important  reason  why  this  heuristic  works  is. 
due  to  the  fact.  that  at  high  workloads  a  bad  memory  location  is  very 
likely  to  be  rediscovered  without  much  latency.  In  many  cases  it  was 
found  that  this  lead  to  a  system  termination.  A  sample  of  the  hardware 
failure  data  obtained  on  this  basis  is  shown  in  Table  1. 
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TABLE  1 

Sample  error  data  (EREP) 
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2.2  MBKLQAB  MEASUREMENT 

Since  errors  in  processors  occur  fairly  infrequently  (on  the  order  of 
once  a  day  for  our  measurements) ,  correlation  uith  workload  requires 
long  term  workload  figures.  The  workload  data  comes  from  two  sources: 
the  built-in  system  utilization  facility,  and  a  software  monitor  written 


specifically  for  this  study.  They  are  discussed  below. 

The  operating  systems  in  the  processors  measured,  use  IBM's  System 
Management  Facilities  (SMF)  for  usage  accounting.  SMF  was  originally 
designed  to  provide  accounting  information,  but  it  has  evolved  over  the 
years  to  include  more  general  performance  measurement  information.  SMF 
is  discussed  exhaustively  elsewhere  [IBM  73],  [Butner  80]  and  will  not 


be  detailed  here 
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In  general  *  SHF  data  consists  of  records  giving  resource  utilization 
figures  for  jobs*  files*  1/0  devices*  and  a  potpourri  of  statistics 
gathered  and  written  on  a  periodic  basis.  For  this  work  we  use  the  type 
4  (Step)  record*  which  holds  statistics  for  each  job  step  as  it  eom- 
pletes  execution,  and  the  type  1  (Wait)  record*  written  roughly  every  10 
minutes,  which  summarises  global  system  utilization  during  that  10  min¬ 
ute  period.  With  careful  processing  SflF  can  provide  excellent  workload 
statistics*  especially  when  high  resolution  results  are  not  needed. 

To  obtain  more  detailed  information  about  transient  behavior  in  the 
CPU  we  implemented  an  interrupt  rate  monitor*  called  INTRACK.  There  are 
four  classes  of  interrupts  in  the  IBM  370  architecture:2 

1.  external  (EXT)  —  Used  by  the  operating  system  for  clocks  and 

inter-CPU  communication. 

2.  Supervisor  Call  (SVC)  —  Caused  by  any  SVC  instruction.  Used  for 
operating  system  services*  such  as:  memory  allocation*  synchron¬ 
ization.  I/O.  timing,  etc. 

3.  Program  (PROG)  —  Program  traps  due  to  arithmetic  conditions 

(e.g.  division  by  zero),  invalid  operations*  or  page  faults. 

4.  Input/Output  (I/O)  —  From  completion  of  I/O  operations. 

The  interrupt  monitor  (INTRACK)  archived  the  interrupt  data  along 
with  the  SHF  data  described  above.  Table  2  summarizes  the  sources  of 
data  for  the  workload  information. 


1  Hachine  check  interrupts  are  not  considered  here  because  they  are 
already  collected  in  the  EREP  data. 


to 


TABLE  2 

Input  data  for  workload  variables. 


Record 

When  generated 

Contents  used 

Step 

At  end  of  each  batch  job 
step 

Accounting  and  job  usage 
data.  e.g.  CPU  time.  No.  of 

I/Os.  memory  usage. 

Wait 

Approx,  every  10  minutes 

CPU  wait  time  during  preced¬ 
ing  10  minute  period. 

INTRACK 

Normally  every  10  minutes 
(but  settable) 

Contents  of  four  cumulative 
interrupt  counters  for: 

External.  SVC.  Program.  I/O. 

3.  OVERVIEW  OF  THE  MEASUREMENT  SYSTEM 
An  objective  of  the  measurement  system  was  to  make  data  management  as 
automatic  as  possible  so  that  it  is  unnecessary  to  know  the  particulars 
of  operating  systems,  software  monitors,  record  formats,  and  the  like. 
The  Statistical  Analysis  System  (hereafter  called  SAS)  [SAS  79]  provided 
a  rich  environment  for  data  handling,  in  addition  to  its  procedures  for 
statistical  analysis.  Once  a  few  programs  were  written  to  capture  and 
reduce  the  raw  data,  the  information  was  immediately  built  into  SAS  data 
bases  (called  SAS  data  sets),  on  which  the  full  power  of  SAS  could  be 
used  to  sort,  select,  merge,  and  extract  information.  More  than  50  SAS 
programs,  some  very  simple,  were  written  to  perform  a  variety  of  data 
handling  operations  on  the  data  bases.  This  section  discusses  the  sys¬ 
tem  as  a  whole,  describing  the  flow  of  data  in  general  terms.  Later. 


important  components  such  as  error  clustering  and  workload  smearing  are 
covered  in  detail. 


The  transformation  of  raw  workload  and  error  data  into  usable  data 
bases  for  analysis  is  performed  by  a  collection  of  programs,  some  writ¬ 
ten  in  PL/I  and  many  written  in  SAS.  Refer  to  Figure  3  for  the  organi¬ 
zation  of  these  processors  and  the  flow  of  data  through  them  as  they  are 
described  in  the  following  sections. 
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Figure  3:  Detailed  data  flou  in  the  measurement  system 
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3. 1  PROCESSING  IM  WORKLOAD  MIA 

Workload  processing  begins  Kith  a  program  written  to  select  and  con¬ 
dense  a  specified  set  of  SMF  record  types.  This  program  is  used  to  pro¬ 
cess  the  thirty  reels  of  tape  comprising  the  archived  StIF  data  from  1979 
to  the  present. 

3.1.1  Five  minute  intervals  and  smearing 

A  number  of  workload  variables  are  defined  to  provide  estimates  of 
various  characteristics  of  system  load  throughout  the  three  year  meas¬ 
urement  period.  They  are  summarized  in  Table  3. 


TABLE  3 

Definitions  of  workload  variables. 


Name 

Units 

Derived  From 

Indicates 

COREQ 

KBytes 

Summed  Smear 

Batch  memory  requests 

COREU 

KBytes 

Summed  Smear 

Batch  memory  usage 

VOL WAIT 

sec. 

Summed  Prorated 

Smear 

Batch  I/O  wait  time 

EXCP 

1/sec 

Summed  Prorated 

Smear 

Batch  induced  I/O  load 

PAGEI 

1/sec 

Summed  Prorated 

Smear 

Batch  paging  (in) 

PAGEO 

1/sec 

Summed  Prorated 

Smear 

Batch  paging  (out) 

BATCPU 

fract. 

Summed  Prorated 

Smear 

Batch  CPU  usage 

SYSCPU 

f ract. 

SHF  Wait.  BATCPU 

Nonbatch  CPU,  Ovhd.,  etc. 

TOTCPU 

fract. 

SHF  Wait,  Smeared 

Overall  CPU  load 

EXT 

1/sec 

INTRACK 

Timer  and  clock  activity 

SVC 

1/sec 

INTRACK 

Overall  O.S.  activity 

PROG 

1/sec 

INTRACK 

Paging/prog,  exceptions 

I/O 

1/sec 

INTRACK 

Overall  I/O  activity 
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The  workload  time  granularity  was  defined  to  be  five  minutes*  meaning 
that  for  each  five-minute  period  from  January  1979  a  vector  of  13  work¬ 
load  variables  was  created.  The  process  described  below  is  applied  to 
each  of  the  variables.  Essentially*  the  process  takes  what  information 
is  available  in  a  record  and  distributes  it  into  the  time  slots  the 
record  describes. 

Each  input  record  provides  a  starting  time,  an  ending  time,  and  a 
value  for  one  or  more  load  measures.  Each  of  these  measures  is 
"smeared"  into  the  five-minute  bins  defined  by  the  starting  and  ending 
time  of  the  event,  either  on  a  proportional  basis  (for  variables  repre¬ 
senting  counts  or  times),  or  directly  (for  "level"  variables,  such  as 
memory  usage).  The  algorithm  also  takes  care  of  the  subtle  handling  of 
partial  bins  at  the  interval  endpoints,  in  addition  to  the  case  where 
both  endpoints  lie  somewhere  in  the  same  bin.  For  these  cases  the 
amount  accumulated  into  the  bin  is  weighted  by  the  fraction  of  time 
spent  in  the  bin.  Figure  4  presents  an  actual  numerical  example  with 
four  jobs  overlapping  in  various  ways.  Notice  that  the  height  of  each 
bin  is  the  sum  of  the  time  averaged  values  of  input  values  entering  that 
bin.  This  averaging  is  similar  to  approximations  that  occur  in  numeri¬ 
cal  integration  problems. 

As  stated  earlier,  the  smearing  is  done  one  month  at  a  time,  with 
approximately  8640  bins  per  month,  depending  on  the  number  of  days  in 
the  month.  Finally,  the  estimates  are  concatenated  into  one-year  groups 
to  form  the  "Five-Minute  Smeared  Data."  For  example,  a  complete  day  of 
smeared  points  (the  288  five-minute  bins  from  Monday,  January  S,  1981) 


Example  Smearing  sd  ihs.  ££ii  lima  ai  Four  Jobs 


Job 

Start 

Time 

End 

Time 

Elapsed 

T  ime 

CPU 

T  ime 

CPU/El apsed 

A 

0.5 

4.7 

'4.2 

2.0 

0.46 

B 

3.3 

9.5 

6.2 

7.0 

1.13 

C 

4.  1 

5.  a 

1.7 

1.2 

0.71 

0 

7. 1 

7. a 

0.7 

0.3 

0.43  (0.30)* 

*  Since 

job  0 

is  completely 

within  a 

bin,  its 

value  is 

prorated  into  that  bin's  sun. 

Smearing  Example 
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0123456789  10 

Tiros  (Units  From  Table) 


Figure  4:  Example  of  Smearing  Algorithm 
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for  two  variables  is  given  in  Figure  5.  Each  small  step  in  the  figure 
is  a  five-minute  average;  the  solid  upper  line  represents  percent  CPU 
busy,  the  dotted  lower  line  is  batch  CPU.  The  plot  shows  the  familiar 
early  morning  lull  between  5  and  6  am  with  a  dramatic  climb  to  full  uti¬ 
lization  at  about  10  am.  Notice  that  in  the  evening,  from  about  10 
o'clock  on,  batch  work  forms  most  of  the  CPU  load,  while  during  the  day 
it  is  only  in  the  35  to  40X  range  with  the  remainder  going  to  timeshar¬ 
ing  and  overhead.  It  is  also  interesting  to  note  that  at  a  few  rare 
points  batch  CPU  seems  to  be  greater  than  the  total.  This  is  due  to  the 
averaging  algorithm's  smearing  of  a  job's  CPU  usage  evenly  over  the 
job's  duration  uhile  the  total  CPU  figure  is  derived  from  a  10-minute 
global  system  total. 

To  study  longer-range  loading  effects  we  also  built  a  data  base  of 
one-hour  smeared  workload  vectors.  Each  one-hour  point  is  derived  from 
the  five-minute  smeared  data  by  averaging  the  twelve  five-minute  points 
in  that  hour  and  tagging  the  new  point  with  the  starting  time  of  that 
hour.  There  are  6760  such  vectors  in  a  non-leap  year.  Another  reason 
for  creating  the  one-hour  data  is  to  test  whether  system  crashes  occur¬ 
ring  soon  after  CPU  failures  cause  the  five-minute  averages  in  the 
period  preceding  the  failure  to  be  artificially  decreased.  This  could 
happen  because  jobs  executing  at  the  time  of  the  crash  would  not  con¬ 
tribute  to  the  smeared  totals  as  they  should.  A  preliminary  analysis 
showed  this  not  to  be  a  problem. 
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Tima  ot  Day  (Hours) 


DOTS: .  Smeared  Batch.  CPU  Time  (Percent). 

SOLID.  Smeared  Total  CPU  Time  (Percent) 

Figure  5:  One  Day  of  Batch  CPU/Total  CPU  Data 

3.2  PROCESSING  THE  ERROR  DATA  (BUILD) 

This  section  presents  the  method  used  to  process  rau  errors  into  the 
data  base  used  for  analysis.  A  SAS  program,  called  BUILD,  performs  the 
fol lowing  steps: 

(i)  Select  •-  The  rau  EREP  data  includes  CPU,  channel,  and  device  errors 
for  all  equipment  in  the  installation.  Only  CPU  (Machine  Check)  errors 
on  the  tuo  370/168s  are  selected  for  analysis. 

(ii)  Decode  and  Cl assi fv:  In  each  MCH  record  there  are  a  number  of  bits 
describing  the  type  of  failure,  its  severity,  and  the  result  of  hardware 
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and  software  attempts  to  recover  from  the  problem.  These  bits  are 
decoded  into  classes  meaningful  to  this  analysis  and  analyzed  in  later 
processing.  General  machine  check  status  indicators  are  provided  by  the 
hardware  are  described  fully  in  the  System/370  Principles  of  Operation 
[IBM  81]). 

(iii)  Sort  Bv  Processor  and  Time ;  To  facilitate  clustering  in  the  next 
step  it  is  necessary  to  sort  the  data  by  CPU  id  (serial  number)  and  time 
of  failure  within  CPU  id. 

(iv)  Cluster;  Errors  occurring  within  S  minutes  of  each  other  were 
coalesced.  For  each  error  point»  the  following  test  was  performed: 

IF  (error  type)  =  (error  type  of  previous  error) 

AND  (time  away  from  previous  error)  £  5  minutes 
THEN  (fold  this  error  into  the  cluster  being  built) 

ELSE  (start  a  new  cluster). 

The  result  is  a  set  of  clustered  errors  for  each  year.  Associated  with 
each  cluster  is  information  consisting  of  error  classifications,  number 
of  points  in  the  cluster,  time  of  first  and  last  errors  in  the  cluster, 
and  a  variety  of  status  data  provided  by  the  hardware  and  operating  sys¬ 
tem. 

Some  interesting  things  can  be  learned  from  a  cursory  analysis  of  the 
clusters  derived  as  stated.  Summary  statistics  for  the  number  of  points 
in  a  cluster  (NPOINTS)  and  time  spanned  by  a  cluster  (SPAN)  are  shown 
below  in  Table  4.  Table  A  shows  clearly  that  the  clustering  algorithm 
is  having  an  effect  by  gathering  long  bursts  of  failures  into  a  few 
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large  clusters,  indicated  by  the  maximum  192  points  and  1310  second  time 
span.  The  table  also  shous  that  lone  failures  predominate,  uith  median 
cluster  size  of  one  and  time  span  of  zero,  showing  that  the  clustering 
algorithm  is  not  artificially  forcing  them  together.  The  accompanying 
bar  charts  also  show  this  behavior.  Clustering  is  important  in  the 
load-failure  analysis  to  avoid  biasing  the  results  uith  repeated  errors 
from  the  same  failing  component. 


TABLE  4 

Clustering  Statistics 


Original  Errors:  1,903  Clustered  Errors:  456  (765C  reduction) 


NP0IHTS 


Mean  4.17 
Median  I 
90th  Percentile  5 
Minimum  1 
Maximum  192 


SPAN  (seconds) 
20.44 
0 

48.6 

0 

1310 


Cluster  Size  Distribution 


Time  Span  Distribution 


3 

T 


400 
300 
200 
100  H 
0 


. . . 


0  10  20  30  40 

Tim*  Span  (seconds) 


(40  I:  Over) 


20 


3.3  COMBINING  WORKLOAD  AHO  ERROR  DATA  (WATCH) 

The  final  and  most  important  step  of  the  data  base  building  process 
is  the  matching  of  errors  and  workload.  By  matching  tie  mean  the  combin¬ 
ing  of  each  error  point  uith  information  on  system  workload  at  the  time 
of  the  error.  The  clustered  error  points  are  processed  sequentially  and 
for  each  points  (1)  The  time  of  the  five-minute  interval  preceding  the 
error  is  calculated,  and  (2)  used  as  a  key  to  locate  its  corresponding 
workload  observation.  Then  (3)  the  vector  of  workload  variables  from 
that  observation  is  merged  into  the  error  observation. 

In  order  to  determine  the  load  at  the  time  of  failure,  the  5-minute 
load  averages  (which  we  refer  to  as  smeared  averages)  were  merged  with 
the  EREP  log.  The  load  at  failure  was  taken  to  be  the  load  in  a  five 
minute  interval  prior  to  the  failure  to  eliminate  perturbations  from 
system  error  recovery  or  a  system  crash.  The  matching  is  shown  in  fig¬ 
ure  6. 
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Note  that  the  interval  containing  the  failure  is  not  used  because  of  the 
measurement  distortion  that  can  be  caused  by  error  recovery  activities, 
and  the  fact  that  the  system  may  not  continue  to  run  after  the  error. 
Also,  the  exigencies  of  a  system  crash  may  prevent  the  operating  system 
from  gathering  workload  and  accounting  statistics. 

In  the  case  of  one-hour  averaged  workload  measurements,  the  algorithm 
is  the  same  except  that  the  previous  hour's  load  is  used. 


3.4  SUMMARY  fl£  DATA  BASE 

Summarising  the  above  presentations,  the  following  major  sets  of  data 
were  created: 

•  Clustered  and  unclustered  "pure"  errors  -  from  which  standard  fail¬ 
ure  analysis  can  be  drawn  to  obtain  a  number  of  statistics,  e.g. 
mean  time  to  failure,  hazard  with  time,  etc.  See  [Shooman  1968] 
for  more  information. 

•  Three  years  of  workload  information  -  also  useful  for  studies  not 
necessarily  related  to  reliability.  These  points  exist  in  both 
five-minute  and  one-hour  granularities. 

•  Errors  matched  with  workload  -  in  both  the  five-minute  and  one-hour 
forms.  These  observations  can  be  used  to  study  the  connection 
between  load  and  errors  in  large  computer  systems. 


4.  ANALYSIS 


4. 1  WORKLOAD  AND  ERROR  ANALYSIS 

The  data  consisted  of  three  years  of  load/failure  measurements,  1979, 
1980  and  1981.  The  1981  data  contains  additional  measurements  made  by 
our  special  purpose  interrupt  monitor.  Initially,  we  analyzed  each  year 
separately.  Since  there  was  no  significant  difference  in  the  1979  and 


1980  results,  it  was  considered  appropriate  to  combine  the  corresponding 
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load-failure  data.  Of  the  thirteen  workload  measures  collected  for  the 
study#  four  were  chosen  to  be  studied  for  1979  and  1980.  They  were: 

1.  COREU  —  The  sum  of  memory  allocated  by  batch  jobs  (K  bytes). 

2.  EXCP9  —  The  I/O  initiatation  rate  by  batch  jobs  (I/Os  per  sec¬ 
ond)  . 

3.  SYSCPU  —  CPU  utilization  for  system,  i.e.  non-batch,  tasks  (a 
fraction  between  0  and  1). 

4.  TOTCPU  —  Total  CPU  usage  (a  fraction  between  0  and  1). 

For  1981  the  following  interrupt  measurements  were  also  included: 

1.  SVC  —  Supervisor  calls  (rate  per  second). 

2.  10  —  I/O  interrupts,  completion  of  I/O  operations  (rate  per  sec¬ 
ond)  . 

3.  PROG  —  Program  interrupts  (rate  per  second). 

Measures  such  as  the  SYSCPU  and  10  provide  a  measure  of  the  system 
interactive  load,  while  measures  such  as  TOTCPU  provide  a  general  view 
of  the  CPU  usage.  The  variable  "BATCPU",  derived  from  the  difference 
between  is  a  direct  measure  of  batch  usage. 

Recall  that  the  data  base  developed  contains  not  only  the  values  for 
the  specified  workload  variables  to  a  five  minute  resolution  but  also 
the  values  of  the  same  variables  matched  with  failure  times.  From  this 
data  two  types  of  distributions  were  developed.  The  first.  JR ( x )  is  sim¬ 
ply  the  distribution  of  the  workload  variable  in  question 

A(x)  =  Pr  (workload  =  x}* 


s  An  acronym  for  "execute  Channel  Program" 

*  The  workload  (or  load)  is  assumed  to  be  a  discrete  random. variable  for 
this  discussion. 


The  second  is  the  joint  distribution  of  failure  and  the  workload  meas¬ 


ure: 


f (x)  =  Pr  {failure  occurs  and  load  =  x). 

In  this  expression'  failures  and  load  values  are  represented  as  they 

/ 

occur  on  an  actual  system>  where  favored  loads  contribute  more  to  the 
distribution  than  loads  of  low  probability.  To  remove  this  effect  we 
divide  f(x)  by  the  associated  load  probability  Jt(x).  Using  the  well 
known  notion  of  a  conditional  probability  distribution  [Feller  68]  we 
write 

f  (x) 

g(x)  =  Pr  {failure  occurs  |  load  =  x)  =  - 

JUx) 


Therefore  g(x)  can  be  thought  of  as  the  probability  of  a  failure  at  a 
given  load  when  al 1  loads  arc  equal  1 v  represented;  it  is  the  conditional 
failure  probability.  In  the  figure  g(x)  represents  the  conditional  pro- 
bablities  arranged  by  increasing  x  (workload).  Note  that'  since  each  of 
these  probablities  are  calculated  independently.  g(x)  is  not  a  probabil¬ 
ity  distribution  in  the  regular  sense  of  the  term.* 


*  A  commonplace  analogy  to  illustrate  the  above  distinction  is  that 
automobiles  travelling  at  150  mph  have  a  higher  probability  of  acci¬ 
dent  than  those  travelling  at  55  mph.  However,  there  are  far  more 
accidents  for  autos  going  55.  To  obtain  an  accurate  representation  of 
the  risks  involved  in  travelling  at  high  speed,  we  must  divide  the 
number  of  accidents  occurring  at  each  speed  by  the  number  of  autos 
travelling  at  that  speed. 
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Figures  7  and  8  depict  the  Jt.  f,  and  g  distributions  of  System  CPU 
(SYSCPU)  and  I/O  and  Total  CPU  (TOTCPU)  for  1981. 

As  a  general  observation  we  note  that*  where  the  difference  between 
A(x)  and  f(x)  is  considerable,  we  might  expect  to  see  a  workload  depen¬ 
dency  in  the  failures.  If  Jt(x)  and  f(x)  are  similar,  the  relationship 
is  probably  not  significant.  A  g(x)  distribution  weighted  in  favor  of 
higher  workload  values  will  clearly  generate  a  higher  risk  of  failure  as 
the  load  increases. 

It  would  appear  from  the  g(x)  plots  for  SYSCPU  and  10  that  higher 
values  of  these  measures  (>  50  for  10)  contribute  more  significantly  to 
hard  failures  than  the  lower  values.  Examining  the  plots  for  TOTCPU  we 
note  that,  as  measured  by  CPU  utilization,  the  system  was  heavily  loaded 
most  of  the  time.  The  A(x)  and  g(x)  plots  for  TOTCPU  show  considerable 
similarity.  It  would  therefore  appear  from  this  cursory  analysis  that 
failures  are  not  induced  by  higher  execution  rates,  as  measured  by  CPU 
usage  alone. 

In  order  to  quantify  this  effect,  in  particular  to  determine  exactly 
the  risk  or  "hazard"  associated  with  higher  workload  values,  we  employed 
what  we  refer  to  as  a  "load  hazard"  model,  the  development  and  applica¬ 


tion  of  which  is  discussed  in  the  next  section. 
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5.  THE  LOAD  HAZARD  I10DEL 
The  object  of  the  analysis  was  to  determine: 

1.  Does  a  higher  level  of  system  utilization  result  in  a  higher  risk 
of  failure  than  a  lower  level? 

2.  Is  the  relationship  linear  with  the  workload  variables,  or  is 
there  a  nonlinear  increasing  effect? 

In  practical  terms,  if  such  an  effect  exists,  it  is  expected  that  the 
load  will  act  as  a  stress  factor.  For  this  purpose  we  developed  and 
validated  a  load-hazard  model  which  formed  the  basis  for  our  tests.  A 
detailed  description  of  the  development  and  validation  of  this  model 
appears  in  [Iyer  82b].  Briefly,  an  inherent  load  hazard  z(x)  is  defined 
as 


Pr  {Failure  in  load  interval  Cx,  x+Ax)} 

z(x)  =  - — - _____ -  (1) 

Pr  [No  failure  in  load  interval  (0,  x)} 


In  close  analogy  with  with 
theory  [Shooman  68],  z(x) 
increasing  the  workload  from 


the  classical  hazard  rate  in 
measures  the  incremental  risk 
x  to  x+Ax*  (e.g.  if  the  system 


rel iabil ity 
involved  in 
is  currently 


*  In  applying  the  load  hazard  model  to  our  data  we  made  a  simplifying 
assumption  that  the  workload  monotonical ly  increases  until  failure 
occurs.  This  is  a  conservative  assumption  which  was  made  primarily  to 
simplify  some  cumbersome  aspects  of  the  data  analysis.  It  has  the 
additional  advantage  of  allowing  us  to  estimate  a  lower  bound  on  the 
workload  related  risk  (if  any).  This  is  due  to  the  fact  that  under 
the  assumption  of  a  monotonical 1 y  increasing  workload,  factors  such  as 
cycling  (between  low  and  high  usage)  and  other  random  variations  are 
ignored.  It  is  well  known  that  such  stresses  only  serve  to  add  to  the 
hazard  rate  [Kujowski  78],  [Arsenault  80].  Thus  by  neglec.ting  them  we 
underestimate  the  hazard  being  measured. 
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operating  at  80  percent  of  full  load,  as  measured  by  CPU  usage,  uhat  is 
the  increase  in  the  risk  of  failure,  if  the  load  is  increased  to  90  per¬ 
cent?) 

The  numerator  of  z(x)  uas  determined  from  g(x).  The  survival  prob¬ 
ability  in  the  denominator  (i.e.  the  probability  of  no  failure  in  the 
load  interval  (0,x))  uas  for  practical  purposes  found  to  be  very  close 
to  the  probability  of  reaching  a  given  uorkload  or  higher  (determined 
from  the  uorkload  distribution  l(x)>.  This  is  simply  due  to  the  fact 
that,  in  our  data,  failure  events  are  much  feuer  than  the  five  minute 
uorkload  samples.  Consequently,  most  often,  uhen  a  given  uorkload  is 
reached  no  failure  has  occurred  (i.e.  failures  are  quite  infrequent). 

If  z(x)  increases  uith  x,  it  should  be  clear  that  there  is  an 
increasing  risk  of  a  failure  as  the  uorkload  variable  increases.  If, 
houever,  z(x)  remains  constant  for  increasing  x,  ue  may  surmise  that  no 
increased  risk  is  involved. 

Note  that  in  our  definition  of  load  hazard  ue  have  removed  the  vari¬ 
ability  of  system  load  by  using  the  conditional  probability  g(x).  This 
of  course  is  not  true  in  practice  since  load  is  best  described  as  a  ran¬ 
dom  variable  uith  a  probability  distribution;  it  is  simply  the  associ¬ 
ated  load  distribution,  £(x),  defined  above.  In  order  to  determine  the 
hazard  for  a  particular  load  pattern,  ue  must  multiply  the  associated 
load  probability  by  the  hazard  calculated  in  (1).  Denoting  by  z»(x) 
the  transformed  hazard,  ue  have 

z»(x)  =  z(x)  £(x)  (2) 
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Me  refer  to  the  hazard  z(x),  as  defined  in  (1),  as  the  fundamental 
hazard.  This  is  because  it  can  be  thought  of  as  an  inherent  property  of 
a  particular  system  and  is  not  subject  to  varying  load  patterns.  When  a 
varying  load  pattern  is  taken  into  account,  it  can  be  thought  of  as 
"picking  out"  aspects  of  the  fundamental  hazard  function.  This  hazard 
z»(x)  defined  in  (2)  Mill  be  referred  to  as  the  apparent  hazard,  since 
it  is  closely  dependent  on  the  toad  distribution. 

6.  HAZARD  PLOTS 

The  generation  of  the  hazard  plots  and  associated  statistics  involved 
extensive  data  processing.  In  each  hazard  plot.  z(x)  or  z,Cx)  is  calcu¬ 
lated  and  plotted  as  a  function  of  a  chosen  uorkloaa  variable,  x.  In 
developing  hazard  plots  for  the  load-failure  data,  there  is  an  important 
difference  betueen  the  real  and  the  artificially  created  data.  This 
lies  in  the  fact  that.  Mhile  an  artifical  data  base  has  specific  depen¬ 
dencies  seeded  into  it.  in  the  real  uorld,  failures  can  occur  due  to  a 
number  of  causes.  Examples  are:  temperature,  humidity,  random  noise, 
mechanical  failures,  and  design  errors,  some  of  which  are  unrelated  to 
our  study.  Those  factors  not  related  to  load  can  be  expected  to  behave 
as  noise  in  a  load-failure  analysis.  If  these  other  factors  are  predom¬ 
inant.  we  can  expect  to  find  no  discernable  pattern  in  our  hazard  plots 
i.e.  they  should  appear  as  uncorrelated  clouds.  This  is  well  understood 
in  any  statistical  study  of  dependencies. 

An  easily  discernable  pattern,  on  the  other  hand,  would  indicate  that 
the  load-failure  dependency  dominates  others.  The  strength  of  such  a 
relationship  can  be  measured  through  regression,  figures  10.  11.  and  12 
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depict  the  hazard  plots  for  the  three  selected  load  parameters.  The 
regression  coefficient  R2.  which  is  an  effective  measure  of  the  good¬ 
ness  of  fit,  is  provided  for  each  plot.  Quite  simply,  it  measures  the 
amount  of  variability  in  the  data  that  can  be  accounted  for  by  the 
regression  model.  R2  values  of  greater  than  0.6  (corresponding  to  an 
R  >  0.75)  are  generally  interpreted  as  strong  relationships7  [Younger 
79].  It  can  be  seen  that  the  hazards  are  increasing  with  each  of  the 
load  parameters  shown.  The  relationship  is  particularly  strong  with 
SYSCPU  and  10.  although  other  measures  such  as  EXCP.  SVC  and  PROG  (plots 
not  shown),  also  correlate  strongly.  Note  that  these  measures  in  one 
way  or  another  measure  the  interactive  workload  with  some  degree  of 
overlap  and  have  different  degrees  of  variability.  TOTCPU,  a  general 
measure  of  execution  also  correlates  moderately  strongly.  In  addition, 
it  is  seen  that  the  workload-failure  relationship  is  highly  non-linear. 
This  appears  to  indicate  that  toward  the  existence  of  a  threshold  beyond 
which  the  system  worsens  very  rapidly. 

It  is  interesting  to  note  that  most  of  the  estimated  hardware  fail¬ 
ures  were  failures  in  main  memory.  An  analysis  of  these  failures  by 
time  of  day  showed  that  they  generally  occur  during  the  period  when  the 
main  memory  access  rate  and  the  interactive  workload  measures  (e.g 
SYSCPU  and  10)  are  the  highest  (i.e.  during  prime  time  ).• 


7  The  range  of  |r|  from  0  to  t  is  typically  divided  as  follows:  (0, 
0.25)  moderately  weak;  (0.25,  0.5)  moderate;  (0.5,  0.75)  moderately 
strong;  (0.75,  1.0)  strong. 

*  We  also  note  that,  at  SIAC,  most  the  heavy  compute  bound  jobs,  which 
require  relatively  low  overhead  are  run  at  night,  when  the  paging  rate 
and  consequently  the  main  memory  access  rate  is  relatively  low. 
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7.  DISCUSSION  AND  CONCLUDING  REMARKS 
The  analysis  shows  that  there  is  a  strong  load  dependency  of  internal 
CPU  errors  at  SLAC.  The  observed  tendency  is  present  in  three  years  of 
load  data  analyzed.  This  is  significant  because  our  previously  reported 
results  could  only  provide  us  with  an  external  view  of  permanent  system 
and  component  failures.  By  examining  the  CPU  error  generation  process 
ue  have  been  able  to  study  the  inner  behavior  of  the  system  and  its 
reaction  to  errors.  Consequently*  we  have  gathered  the  best  data  possi¬ 
ble.  A  load-failure  relationship  found  at  this  level  must*  in  our  view* 
be  a  fundamental  phenomenon. 

The  analysis  procedure  has  been  demonstrated  on  artifically  created 
data  base  seeded  with  failures.  The  two  hazard  models  proposed  clearly 
differentiate  between  fundamental  (or  inherent)  and  apparent  load  depen¬ 
dent  failures.  An  estimate  of  the  fundamental  hazard  z(x),  provides  the 
basic  load-failure  relationship.  The  apparent  hazard  z*(x)  estimates 
how  z(x)  is  modified  by  the  load  probabilities.  It  is*  in  principle* 
possible  that  even  when  no  inherent  relationship  exists  between  load  and 
failures,  ue  could  conceivably  obtain  an  apparent  dependency  simply  due 
to  the  fact  that  some  load  values  occur  more  frequently  than  others. 
Alternatively*  ue  can  have  the  reverse  situation  where  an  increasing 
fundamental  hazard  is  transformed  into  a  non-increasing  or  even  decreas¬ 
ing  apparent  hazard  by  a  distinctive  load  distribution. 

As  with  any  statistical  analysis*  this  >s  not  proof  in  itself. 
More  measurements  and  experiments  are  necessary  to  further  study  this 
problem.  However*  the  increasing  body  of  evidence  accumulated  on  dif¬ 
ferent  computers  with  differing  load  and  failure  patterns  shows  that 
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workload  should  be  considered  as  a  factor  relating  to  reliability. 
Workload  can  be  thought  of  as  a  stress  on  the  system,  with  greater 
stresses  resulting  in  greater  risk  of  failure.  In  view  of  our  previous 
results,  we  believe  that  the  error  process  which  ensues  is  composed  of 
two  separate  effects.  The  first  is  the  (constant)  inherent  failure 
rate.  This  is  determined  through  classical  reliability  techniques 
[Shooman  68],  taking  into  consideration  such  factors  as  topology,  redun¬ 
dancy  etc.  The  second  is  the  utilization-induced  failure  rate.  This 
rate  is  dependent  upon  both  the  absolute  level  of  system  utilization  and 
the  rate  of  change  of  that  level.  By  an  absolute  level  we  mean  an  obvi¬ 
ously  measurable  level;  e.g.,  CPU  utilization,  memory  occupancy,  etc. 
Through  the  rate  of  change  of  utilization  we  are  attempting  to  measure 
the  rate  at  which  transitions  occur  between  various  system  states,  e.g. 
the  transitions  of  the  CPU  into  and  out  of  the  busy  state.  In  most 
cases  the  effect  of  this  stress  is  not  permanent,  since  most  errors  are 
transient  [Iyer  82b],  However,  as  demonstrated  in  this  paper,  there  is 
a  significant  contribution  due  to  hard  device  failures  in  the  CPU  and 
main  storage. 

A  preliminary  examination  of  the  semiconductor  device  literature 
shows  that  some  experimental  and  quantitative  evidence  exists  to  support 
support  our  results.  For  example  the  effect  of  transient  and  intermit¬ 
tent  loading  on  the  rating  of  power  devices  has  been  studied  at  length 
(see  [Ivalo  61]  and  [Blackburn  74]  for  details).  It  is  well  known  that 
the  duty  cycle  of  the  pulse  is  an  important  parameter  in  the  rating  of 
these  devices  for  pulsed  operation.  [Owen  80]  describes  practical  meth¬ 
ods  commonly  employed  to  evaluate  the  thermal  effects  of  repetitive 
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pulsed  loading.  Detailed  analytical  and  experimental  analysis  of  both 
steady  state  and  transient  thermal  behaviour  is  discussed  in  [Newell 
75]. 

There  is  also  evidence  in  the  general  reliability  literature  which 
relates  low  and  high  usage  rates,  of  avionic  and  navigational  equipment 
with  corresponding  reliability  behaviour  (see  [Shurman  78]  and  [Kujowski 
78]  for  details).  It  is  to  be  noted  that  in  each  of  these  two  studies  a 
significant  component  of  the  system  was  electronic  or  digital.  Our 
measurements  show  that  the  effect  is  not  negligible  in  smaller  devices. 
The  design  of  computer  systems  will  be  greatly  aided  if  this  type  of 
analysis  can  help  un cover  cause  and  effect  relationships  in  hardware 

$ 

errors. 
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